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Abstract 

This study aimed to investigate the effects of calibration sample size and item bank size on examinee ability 
estimation in computerized adaptive testing (CAT). For this purpose, a 500-item bank pre-calibrated using the 
three-para meter logistic model with 10,000 examinees was simulated. Calibration samples of varying sizes (150, 
250, 350, 500, 750, 1,000, 2,000, 3,000, and 5,000) were selected from the parent sample, and item banks that 
represented small (100) and medium size (200 and 300) banks were drawn from the 500-item bank. Items in 
these banks were recalibrated using the drawn samples, and their estimated parameters were used in post-hoc 
simulations to re-estimate ability parameters for the simulated 10,000 examinees. The findings showed that 
ability estimates in CAT are robust against fluctuations in item parameter estimation and that accurate ability 
parameter estimates can be obtained with a calibration sample of 150 examinees. Moreover, a 200-item bank 
pre-calibrated with as few as 150 examinees can be used for some purposes in CAT as long as it has sufficient 
information at targeted ability levels. 
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Nearly 50 years of technical research and recent 
developments in computer technology have made 
computerized adaptive testing (CAT) applications 
more feasible and affordable for educational 
institutions worldwide. There are numerous 
advantages of using a CAT platform to deliver tests: 
(i) CAT requires less testing time, (n) the test result 
can be calculated immediately, ( Hi ) the test is easier 
to deploy and less vulnerable to theft, and (/V) it can 
be administered wherever and whenever needed 
(Hambleton & Swaminathan, 1985; Rudner, 1998). 

The success of a CAT program highly depends on 
a large item bank, which is maintained regularly, 
with items distributed across a wide range of ability 
( 9 ) levels. Such an item bank is necessary to obtain 
accurate 9 estimates for examinees whose latent 
trait will be estimated. However, the preparation of 
such a bank entails some challenges. One challenge, 
possibly the most important, is that items that will 
be placed in a CAT item bank must be pretested and 
calibrated on the same scale. Highly accurate item 
parameters are desired because 9 estimates in CAT 
applications are based on these parameters. A critical 
variable confounds item parameter estimation at this 
stage: the size of the examinee sample that will be 
used to pretest items in the bank. 

Sample Size Requirements in Item Response 
Theory-Based Item Calibration 

The item parameter calibration process for a CAT 
item bank is conducted using item response theory 
(IRT) models. IRT typically requires large sample 
sizes for accurate item parameter estimation 
(Hambleton, 1989). This is largely based on a 
previous study by Lord (1968), who concluded 
that the standard errors of item discrimination 
parameters were very high until a test of 50 
items and a sample of 1,000 examinees was used 
in the three-parameter logistic model (3PLM), 
and later studies that were concerned with the 
calibration sample size also supported Lord’s 
finding. Swaminathan and Gifford (1979) found 
that a sample of 1,000 examinees was necessary 
to estimate item parameters with high accuracy in 
the 3PLM. Hulin, Lissak, and Drasgow (1982) also 
concluded that a sample of 1,000 was necessary with 
60 items to accurately estimate item parameters in 
the 3PLM. Although Ree and Jensen (1980) stated 
that accurate item parameter estimates require 
only 500 examinees in the 3PLM, with empirical 
support from studies by Patsula and Gessaroli 
(1995); Tang, Way, and Carey (1993); Yen (1987); 
and Yoes (1995), Lord’s (1968) suggestion to use 
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1,000 examinees as the minimum item calibration 
sample size was accepted by many IRT researchers. 
However, some studies that supported Ree and 
Jensen’s finding that sample sizes less than 1,000 can 
be used without losing much estimation accuracy 
were also published. A study conducted by Gao 
and Chen (2005) found that an item calibration 
sample of 500 can be used to accurately estimate 
item parameters in the 3PLM. Moreover, Weiss 
and von Minden (2012) obtained accurate item 
parameter estimates with a calibration sample of 
200 examinees in the 3PLM. Finally, Akour and A1 
Omari (2013) found that a sample size of 500 was 
adequate to accurately estimate item parameters 
with 30 items in the 3PLM. It was somewhat 
expected that better results can be obtained with a 
sample size of500 after 1995 because more advanced 
parameter estimation procedures were being used 
(e.g., marginal maximum likelihood; Baker & Kim, 
2004) compared with those used in 1968. However, 
these studies could not gain sufficient support from 
practitioners, and Lord’s (1968) suggestion to use 
1,000 examinees to estimate item parameters is 
widely followed even today. 

Calibration Sample Sizes and 6 Estimation in CAT 

Although IRT-based calibration sample size 
studies that focus on item parameter recovery 
have implications for the sizes of samples to be 
used in CAT pre-test item calibrations, it would be 
more useful to determine the effects of calibration 
sample size on 6 estimation accuracy in a CAT 
environment. Surprisingly, there appears to be only 
two studies on calibration sample size and its effects 
on 6 estimation in CAT. 

Ree (1981) conducted a simulation study of 
calibration sample size in adaptive testing, in which 
sample sizes of 500, 1,000, and 2,000 examinees 
and item banks with 100, 200, and 300 items were 
simulated. He calibrated the items in the banks 
with different sample sizes and estimated 6 in 
fixed-length CAT of 10,15, 20,25, 30, and 35 items. 
High correlations between true 9 and estimated 9 
levels were observed when 20 or more items were 
administered. In addition, Ree concluded that a 
200-item bank calibrated with 2,000 examinees is 
required to reduce the absolute error of 9 estimation 
to acceptable levels in a CAT environment. 

Chuah, Drasgow, and Luecht (2006) studied item 
parameter estimation accuracy for 9 estimation in 
computerized adaptive sequential tests (CAST) in a 
simulation study and found that items pre-calibrated 
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with 300 examinees using the 3PLM can be used to 
accurately estimate examinee 6 and to classify the 
examinees as masters or non-masters using CAST. 


Purpose 

Calibration sample size for CAT programs has 
implications not only for item parameter estimation 
accuracy but also for the start-up and maintenance 
costs of CAT programs. Although it might not 
constitute a serious problem for a test publisher with 
nearly unlimited resources to obtain large examinee 
samples, it might not be a feasible option for educational 
researchers who are working for institutions in 
developing countries. It is arduous for these researchers 
to obtain large examinee groups that embody the 
characteristics of target examinees to pretest items, 
and there has been little research on calibration sample 
size and 6 estimation in CAT. Therefore, there is a 
need for a study that investigates the feasibility of using 
small sample sizes to calibrate items in a bank and 
the effects of calibration sample size on 6 estimation 
in CAT. The present study was designed to satisfy the 
abovementioned need. Moreover, because the number 
of items in the CAT item bank is another potential 
source of high cost in CAT development, the effects of 
calibration sample size on 6 estimates were investigated 
conditional on bank size. For this purpose, an answer 
was sought to the question “how do 6 estimates based 
on item parameter estimates obtained from varying 
sample sizes and bank sizes recover the 6 estimates 
from a large item bank calibrated using a very large 
sample of examinees?” 


Method 

Research Data Generation 

The full dataset of the present study was simulated 
using the 3PLM and a Monte-Carlo simulation 
procedure in CATSim software version 4.0.6 (Weiss 
& Guyer, 2012a) with a uniform distribution of 6 
parameters between -3 and +3. The 3PLM is 


n n ' ' l+exp [Da,(0, -4)] 


( 1 ) 


where P..(0.) is the probability of a correct response 
to item i conditional on 6 for examinee j , a is item 
discrimination, b. is item difficulty, and c. is the 
pseudo-chance parameter estimated for item i. The 
3PLM considers the chance or guessing parameter; 
it was used in this study because multiple-choice 
items are frequently used in schools, and there is 


a certain degree of probability of giving a correct 
response to an item by chance in this question type. 
Thus, a model that ignores this chance/guessing 
variable would not have been appropriate for tests 
that use multiple-choice items. 

All item parameters were generated from uniform 
distributions: a parameters ranged from 0.5 to 1.5, b 
parameters ranged from -3 to +3, and c parameters 
ranged from 0.00 to 0.25. As a result, a dataset of 
10,000 examinees and 500 items was obtained. This 
dataset was designed to reflect an operational CAT 
with a 500-item bank that was pre-calibrated using 
10,000 examinees. 


Item Selection for Banks of Different Sizes 

From the full bank with 500 items, items for the 
medium (300 and 200 items) and small (100 items) 
banks were selected using a systematic approach 
to maintain the same quality across all banks. 
Using this approach, from the simulated 500-item 
bank (Bank A), a sample of 400 items was drawn 
using the a parameters as strata in SPSS 20s (IBM 
Corporation, 2011) complex samples module. 
Then, another 400-item sample was drawn from 
the 500-item bank using the b parameters as strata. 
In this manner, two item samples that reflected 
the distributions of the a and b parameters in the 
500-item bank were obtained. From these item sets, 
items that were common in both samples were taken 
into the final sample of items. There were 310 items 
common in both samples of 400 items. To resolve 
this problem, the items were ranked according to 
their a parameters first and then according to their 
b parameters. Ten pairs of items with very similar 
a and b parameters were identified, and the items 
with higher c parameters were eliminated from the 
sample, resulting in the 300-item bank (Bank B). To 
select items for the 200-item bank, two sets of 300 
items were drawn from the 500-item bank using 
the a and b parameters as strata. Items that were 
common in both sets were taken into the final set 
of items. Item pairs with similar a and b parameters 
were identified, and those with higher c parameters 
were eliminated, resulting in the 200-item bank 
(Bank C). For the 100-item bank (Bank D), two 
sets of 200 items were drawn from the 500 items 
taking a and b parameters as strata, and the same 
procedure that was used for selecting the items in 
the 300- and 200-item banks was followed. 

During the item selection procedure for 100-, 
200- and 300-item banks, the responses of the 
simulated 10,000 examinees to the selected items 
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for smaller banks were kept intact and transferred 
to form datasets of 300, 200, and 100 items. In this 
manner, four datasets with 500 (full set), 300, 200, 
and 100 items and 10,000 examinee responses were 
obtained. The drawing of the examinee samples 
that were used to recalibrate items in the 100-, 200- 
and 300-item banks was done based on the datasets 
that had the original (simulated) responses of the 
simulated examinees to the items selected. 

Drawing Calibration Samples 

To have the examinee 9 distribution in the full 
dataset reflected in the drawn samples, the 
examinees’ 9 levels were converted into categorical 
data by assigning a category number to 9s at interval 
of 0.25 (e.g., 6 = 3.00...2.75 = 1; 6 = 2.749...2.50 
= 2); in this manner, 24 discrete 6 levels were 
obtained. Then, using the 6 levels as strata in SPSS 
20 s (IBM Corp., 2011) complex samples module, 
samples of 150 (Harwell & Janosky, 1991), 250 
(Goldman & Raju, 1986; Harwell & Janosky, 1991), 
500 (Akour & Al-Omari, 2013; Baker, 1998; Gao & 
Chen, 2005; Goldman & Raju, 1986; Hulin et al., 
1982; Thissen & Wainer, 1982), 1,000 (Goldman & 
Raju, 1986; Hulin et al., 1982; Lord, 1968; Thissen 
& Wainer, 1982; Weiss & von Minden, 2012; Yen, 
1987), 2,000 (Gao & Chen, 2005; Hulin et al., 1982; 
Ree & Jensen, 1980; Yoes, 1995), 3,000 (Tang et al., 
1993), and 5,000 (Akour & Al-Omari, 2013) that 
had been tested in previous research (including 
those conducted in one- and two-parameter logistic 
models) on IRT-based calibration sample size as 
well as two uncommon sample sizes (350 and 750) 
were drawn. These samples were drawn from each 
of the datasets with 100, 200, 300, and 500 items 
and 10,000 examinee responses. Therefore, 40 
datasets (36 calibration samples and 4 full datasets) 
were obtained, as summarized in Table 1. 


Item Calibration and 9 Estimation Through 
Post-Hoc Simulations 

After item selection and sampling, the 3PLM 
parameters of the items in the 36 sample datasets 
were re-estimated with marginal maximum 
likelihood estimation (MMLE) using default options 
in Xcalibre 4.2 (Guyer & Thompson, 2011). The 
estimated item parameters obtained from the samples 
were treated as the known item parameters in post- 
hoc simulations performed in CATSim (Weiss & 
Guyer, 2012a). Post-hoc simulations function as the 
last step before a live CAT administration. They are 
used to evaluate the CAT item bank, giving the CAT 
developer the opportunity to manipulate various 
parameters before a live CAT so that optimal CAT 
application procedures can be obtained. A post-hoc 
simulation requires a matrix of examinee responses 
to items in a CAT item bank and item parameters 
that are known for items in the bank. The simulation 
utilizes the examinee responses to simulate how the 
CAT item bank would function if the examinees 
actually faced items in banks in a live CAT (Weiss 
& Guyer, 2012b). 

In the present study, full datasets with 10,000 
examinees and 500, 300, 200, and 100 items and 
the item parameters calibrated using 36 samples 
were used in the post-hoc simulations. Thus, 36 
CAT simulations were performed, one for each 
combination of sample size and bank size. In these 
simulations, 0.0 was used as the initial 6 estimate for 
all examinees. Bayesian estimation by maximum a 
posteriori was used with a mean of 0.0 and standard 
deviation of 1.0. Maximum information at the 
estimated 9 level was used as the item selection rule, 
and the CAT was terminated when the standard error 
of the 9 estimate was 0.20 or less or when all items in 
the bank had been used. As a result of the post-hoc 
simulations, 36 CAT 9 estimates were obtained for 
each person in the 10,000-examinee pool. 


Table 1 

Item Banks and Samples Drawn. 


Bank A (Simulated) 

Bank B (Sampled) 

Bank C (Sampled) 

Bank D (Sampled) 

Number of items 

500 

300 

200 

100 

Number of examinees 

10,000 

10,000 

10,000 

10,000 


9 

9 

9 

9 


(5,000 x 500, 

(5,000 x 300, 

(5,000 x 200, 

(5,000 x 100, 


3,000 x 500, 

3,000 x 300, 

3,000 x 200, 

3,000 x 100, 


2,000 x 500, 

2,000 x 300, 

2,000 x 200, 

2,000 x 100, 

Number of calibration samples drawn from 

1,000 x 500, 

1,000 x 300, 

1,000 x 200, 

1,000 x 100, 

the bank 

750 x 500, 

750 x 300, 

750 x 200, 

750 x 100, 


500 x 500, 

500 x 300, 

500 x 200, 

500 x 100, 


350 x 500, 

350 x 300, 

350 x 200, 

350 x 100, 


250 x 500, 

250 x 300, 

250 x 200, 

250 x 100, 


150 x 500) 

150 x 300) 

150 x 200) 

150 x 100) 

Total number of datasets 

10 

10 

10 

10 
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0 Levels Treated as True 0 

To obtain the “true” 9 levels of the simulated 10,000 
examinees, first, item parameters for the simulated 
full dataset (10,000 x 500) were estimated, and a 
post-hoc simulation was administered in CATSim 
using these estimated parameters. Thereafter, 
the 9 range of the 10,000 individuals in the full 
dataset was found to be between -2.35 and +2.05. 
Estimated item parameters of the full dataset were 
used in this process because all item parameters 
in all research conditions were estimated. Thus, 
the possible discrepancy between the true and 
estimated parameters of the full dataset that were 
attributable to the estimation error caused by the 
parameter estimation software was eliminated. The 
9 estimates obtained after this last simulation were 
taken as the true 9 levels of the simulated examinees 
(Swaminathan, Hambleton, Sireci, Xing, & Rizavi, 
2003) and they were compared with those obtained 
after the 36 post-hoc simulations. 


measurement (Hambleton & Swaminathan, 1985). 
BIFs pertaining to the 500-, 300-, 200-, and 100-item 
banks were plotted after item parameters in the full 
simulated dataset were estimated. Figure 1 indicates 
that item banks obtained with estimated item 
parameters had similar BIFs that covered similar 9 
levels as desired. Moreover, the highest information 
is obtained around 0=1, and the lowest information 
level is on both sides of the 9 continuum. 



—•—100 —200 -n*—300 -"—500 


Figure 1 : BIFs for 100-, 200-, 300-, and 500-item banks. 


Evaluation of Estimation Accuracy 

To evaluate estimation accuracy, correlations (Gao 
& Chen, 2005; Harwell & Janosky, 1991; Hulin et 
al., 1982; Yen, 1987) between the CAT 9 estimates 
that were obtained after the 36 simulations and the 
true 9 levels were calculated. Moreover, root-mean- 
squared difference (RMSD) (Gao & Chen, 2005; 
Harwell & Janosky, 1991; Yen, 1987) and average 
signed difference (ASD) were also calculated for 
these 9 estimates using Equations 2 and 3: 

RMSD (8) = J £ ^ l(§ y 9 ” )2 -, (2) 

ASD (e) = Z "- l(9 '~ 9n) (3) 

where 9. represents the estimated 9 level for the 
jth examinee for each research condition tested, 
9 Ti represents the true 9 level for each examinee as 
defined above and N is the number of examinees. 

Bank Information Functions 

Bank information functions (BIFs) indicate how 
well examinees’ 9 levels at a specific 9 level would be 
measured if all items in an item bank were used to 
estimate 0s. Moreover, the amount of information 
obtained from an item bank at a specific 0 level is 
inversely related to the conditional standard error of 


Results 

Correlations between the 0 estimates that were 
obtained with the 500-, 300-, 200-, and 100-item 
banks and sample sizes varying from 150 to 5,000 
are presented in Figure 2. As shown in the figure, 
the correlations were all over 0.94 regardless of the 
sample size used to calibrate items and the bank size 
employed. Although there was a slight increase in the 
correlations across the sample sizes that were used 
to calibrate items, the correlations obtained ranged 
within a very narrow interval, roughly between 
0.94 and 0.98. Such high correlations indicate 
strong positive linear relations between the true 0 
and estimated 0 regardless of bank size or number 
of examinees. Although correlations remained 
essentially constant or increased with sample size 
for most bank sizes, they slightly decreased as the 
sample size increased in the 100-item bank. 

0.99 
0.98 
0.97 
0.96 
0.95 
0.94 
0.93 — 

0.92 — 

150 250 350 500 750 1000 2000 3000 5000 

—♦—100 —•—200 —*—300 —■—500 

Figure 2: Correlations between 6 estimates for banks with 100, 
200, 300, and 500 items and sample sizes of 150-5,000. 
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The RMSDs and ASDs for the 9 estimates are 
presented in Figures 3 and 4. As can be seen, there 
was a decreasing trend in RMSDs (Figure 3) as the 
calibration sample size increased (except for the 
100-item bank), as was expected. Moreover, the 
increase in the item bank size from 100 to 500 items 
resulted in a decrease in RMSDs. Differences in 
RMSDs were minimal among the 200-, 300-, and 
500-item banks for sample sizes of 2,000 or less and 
slightly larger for the two largest sample sizes. 

0.40 



150 250 350 500 750 1000 2000 3000 5000 

—♦—100 —*—200 —A— 300 —■—500 
Figure 3: RMSDs for banks with 100, 200, 300, and 500 items 
and sample sizes of 150-5,000. 

The ASDs shown in Figure 4 fluctuated when the 
sample size increased from 150 to 350 and stabilized 
with sample sizes of 500 or larger. After this size, ASDs 
remained around 0.0 for all sample sizes across different 
bank sizes (except for the 100-item bank). Although 
larger ASDs were obtained for the 100-item bank, the 
results still indicated a very low (less than -0.05) amount 
of negatively biased estimation of 9 for sample sizes of 
350 and larger. 



150 250 350 500 750 1000 2000 3000 5000 


—•—100 —*—200 —A— 300 —■—500 

Figure 4: ASDs for banks with 100, 200, 300, and 500 items and 
sample sizes of 150-5,000. 

Correlations Conditional on 9 Groups 

Overall, Figures 2, 3, and 4 indicate that accurate 
9 estimation can be obtained across all sample 
sizes and item banks, as suggested by the very high 
correlations and low RMSDs and ASDs between the 
true 9 and estimated 9. However, to determine the 
effects of the item bank and sample size interaction 
on 9 estimation accuracy in CAT applications 
considering the banks’ item information levels 


better, the correlations, RMSDs, and ASDs 
pertaining to these estimates were computed 
conditional on the 9 continuum divided into five 9 
groups (Group 1, 9 = -2.35 to -2.00; Group 2, 9 = 
-1.99 to -1.00; Group 3, 9 = -0.99 to 0.00; Group 
4,9 = 0.001 to 0.99; Group 5,9= 1.00 to 2.05). The 
numbers of examinees ( N ) in each 9 group were 42, 
1,694, 3,012, 3,371, and 1,881 for Groups 1, 2, 3, 4, 
and 5, respectively. 

The correlations obtained conditional on the 9 groups 
are shown in Figure 5; the 9 estimates in Group 1 
(Figure 5a) were not highly correlated with the true 
9 levels of the examinees in this group. Correlations 
were 0.60 or less under all conditions of the item 
bank size and sample size with the exception of the 
N = 3,000, 500-item bank condition, in which the 
correlation approached 0.90. For all sample sizes, the 
correlations were somewhat erratic, possibly because 
of the small number of examinees in this group. 

The correlations obtained for the examinees with 
9 levels from -1.99 to -1.00 (Group 2, Figure 5b) 
were generally relatively higher and more stable. 
However, correlations above 0.70, which indicates 
a moderate correlation (Yoes, 1995), were obtained 
only after the item bank size increased to 500 and 
when the calibration sample size was 250. The 
correlations generally increased as the bank size 
increased from 100 to 500, but differences between 
the correlations were trivial among the 300-, 200-, 
and 100-item banks. For example, the correlations 
obtained from the 300-, 200-, and 100-item banks 
calibrated with 250 examinees were 0.614, 0.614, 
and 0.567, respectively. However, the correlation 
obtained from the 500-item bank in the same 
condition was 0.748. 

The correlations for Group 3 (Figure 5c) ranged 
between 0.611 and 0.803 across the calibration 
samples. The correlations for this 9 range were 
mostly between 0.70 and 0.80, with a slight 
reduction as the bank size decreased in most 
cases. For example, the correlations obtained from 
items that were calibrated with 350 examinees in 
Group 3 were 0.764, 0.759, 0.740, and 0.697 for 
500-, 300-, 200-, and 100-item banks, respectively. 
This decrease was also observed in samples of 
750, 3,000, and 5,000 and partially observed in 
samples of 250, 500, 1,000, and 2,000. Figure 5c 
shows that the correlations for item banks of 300 
and 500 items were very close to each other, with 
sometimes slightly higher correlations (N = 500 and 
1,000) obtained with the 300-item bank. Slightly 
higher correlations were obtained as the calibration 
sample size increased. 
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150 250 350 500 750 1000 2000 3000 5000 


5a. Correlations for Group 1 (6 = -2.35 to -2.00, N = 42) 



150 250 350 500 750 1000 2000 3000 5000 

5b. Correlations for Group 2 (6 = -1.99 to -1.00, N = 1,694) 



5c. Correlations for Group 3 (6 = -0.99 to 0.00, N = 3,012) 




150 250 350 500 750 1000 2000 3000 5000 
—♦—100 —•—200 -#-300 —B-500 


5e. Correlations for Group 5 (6= 1.00 to 2.05, N = 1,881) 

Figure 5: Correlations conditional on 0 groups for banks with 
100, 200, 300, and 500 items and samples of 150-5,000. 


The correlations pertaining to Group 4 in Figure 5d 
ranged between 0.698 and 0.899 across calibration 
samples. Slightly lower correlations were obtained 
as the bank size decreased, as was the case in Group 
3, with a larger decrease for the 100-item bank. The 
correlations for the 100-item bank were uniformly 
the lowest and were essentially constant across 
calibration samples. Quite similar correlations 
were obtained across the calibration samples for 
the other bank sizes with the exception of the 500- 
item bank with larger sample sizes when there was 
an increase for the sample sizes of 3,000 and 5,000. 
Higher correlations were obtained for Group 4 than 
for Groups 1, 2, and 3. 

The correlations for Group 5 (high 6 group) in 
Figure 5e ranged between 0.475 and 0.874. Note 
that the correlations obtained from the 300- and 
500-item banks were quite similar and moved on 
quite similar trajectories across different calibration 
sample sizes. There was a linear relation between the 
sample size and estimation accuracy for these item 
banks. The correlations obtained from the 300- and 
500-item banks were higher than those obtained 
from the 100- and 200-item banks for this 6 range, 
which were essentially identical to each other. 

RMSDs and ASDs Conditional on 6 Group 

The RMSDs and ASDs that were calculated 
conditional on 6 groups are presented in Figures 6 
and 7. In contrast to the low RMSDs in Figure 3, 
the values for 6 Group 1 (Figure 6a) were high, with 
average RMSDs of 1.22, 1.07, 1.05, and 1.09 in the 
100-, 200-, 300-, and 500-item banks, respectively. 

The RMSDs for Group 2 are presented in Figure 
6b; they were lower than those obtained for Group 
1, but were still high, with average RMSDs of 0.93, 
0.97, 0.97, and 0.92 for the 100-, 200-, 300-, and 
500-item banks, respectively. The correlations for 
this group were moderate in some cases for the 
500-item bank. 

The RMSDs obtained for Group 3 are presented in 
Figure 6c; a substantial decrease was observed. The 
values ranged between 0.37 and 0.67, lower than 
those obtained for Groups 1 and 2. 

The RMSDs for Group 4 are presented in Figure 
6d; they ranged between 0.25 and 0.38. It is clearly 
observed in the figure that highly similar RMSD 
values were obtained across all item banks in all 
sample sizes. Although the RMSDs tended to be 
lower as the sample size increased, the magnitude 
of the change was trivial. 
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6b.RMSDs for Group 2(6 = -1.99 to -1.00, N = 1,694) 7b. ASDsfor Group 2(6= -1.99 to -1.00, N = 1,694 



6c. RMSDsfor Group 3(6= -0.99 to 0.00, N = 3,012) 



150 250 350 500 750 1000 2000 3000 500 
7c. ASDsfor Group 3(6= -0.99 to 0.00, N = 3,012) 



150 250 350 500 750 1000 2000 3000 5000 
6d. RMSDsfor Group 4(6= 0.001 to 0.99, N = 3,371) 


0.20 

0.00 

- 0.20 

-0.40 

-0.60 

-0.80 

- 1.00 

- 1.20 

-1.40 




150 250 350 500 750 1000 2000 3000 5000 


7d. ASDsfor Group 4(6 = 0.001 to 0.99, N = 3,371) 
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Figure 6: RMSDs conditional on 6 groups for banks with 
100, 200, 300, and 500 items and samples of 150-5,000. 
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7e. ASDsfor Group 5(6= 1.00 to 2.05, N = 1,881) 

Figure 7: ASDs conditional on 6 groups for banks with 100, 
200, 300, and 500 items and samples of 150-5,000. 

















































































































§ahin, Weiss / Effects of Calibration Sample Size and Item Bank Size on Ability Estimation in Computerized... 


The RMSDs obtained for Group 5 are presented in 
Figure 6e. As shown in the figure, the values ranged 
between 0.20 and 0.29, even lower than those 
obtained for Group 4. This indicates highly accurate 
9 estimation across all bank sizes and sample sizes. 

The ASDs conditional on 9 groups are presented in 
Figure 7. The 9 estimates for Group 1 (Figure 7a) 
had negative ASDs with values ranging between 
-1.25 and -0.29. This indicates a large amount of 
negative bias in 9 estimates for this range, indicating 
substantial underestimation of the true 9s. The 
ASDs for Group 2 (Figure 7b) ranged between 
-0.15 and -0.50 and were high, indicating negatively 
biased estimates of 9 within this range. The ASDs 
for Group 3 (Figure 7c) ranged between -0.01 and 
-0.31, which indicated low ASDs between the 
true 9 and estimated 9. As shown in Figure 7c, the 
ASDs were essentially the same among item banks 
of different sizes across all sample sizes in this 
range. The ASDs obtained for Group 4 (Figure 7d) 
ranged between 0.12 and -0.22, indicating better 
9 estimation in this group than in the previous 
three groups. Finally, the ASDs for Group 5 (Figure 
7e) ranged between -0.14 and 0.19, which also 
indicated biased but accurate 9 estimates. 

Discussion and Conclusions 

The results for the overall correlations, RMSDs, and 
ASDs that were calculated for each item bank and 
sample size combination indicated that although 
sample size and bank size influenced 9 estimation 
in CAT, this influence was negligible. However, 
when the results from the 9 groups were analyzed, 
more detailed implications in terms of the effects of 
the item bank quality and sample size interaction 
on 9 estimation in CAT were obtained. 

The results for 9 Group 1 (9 = -2.35 to -2.00) 
indicated that the bank quality was more important 
than the calibration sample size on examinee 9 
estimation in CAT. This was the only 9 range for 
which item banks had uniformly low information 
levels and thus had the lowest capability of providing 
measurement accuracy. The results also indicated 
that if an examinees 9 and item bank information 
did not match, inaccurate 9 estimates for examinees 
with those 9 levels were obtained. A more important 
finding was that regardless of how large sample size 
was used to calibrate the items, it seemed to have 
no effect on improving the 9 estimates when the 
bank lacked sufficient information for a particular 9 
range. This suggests that if the item bank does not 
have sufficient information for a particular 9 range, 


calibration sample size cannot be used as a means to 
increase 9 estimation accuracy in that range in CAT. 

The item banks used in the present study had more 
information in 9 Group 2 (9 = -1.99 to -1.00) 
compared with 9 Group 1. However, the results 
for this group confirmed the findings in Group 1. 
Although the item banks had more information for 
this range, it was not sufficient to accurately estimate 
9. The correlations, RMSDs, and ASDs indicated 
inaccurate 9 estimates. This again was because of 
the lack of sufficient information in the banks for 
this 9 range, as shown in Figure 1. However, the 
item bank size had a somewhat positive effect on 
the 9 estimates in this range in that more items in 
the item bank resulted in more information. The 
results from Group 2 also confirmed the finding 
that the calibration sample size did not improve 
CAT 9 estimates in 9 ranges for which item banks 
had little information. 

Similar correlations, RMSDs, and ASDs were 
obtained for examinees in Group 3 (9 = -0.99 to 
0.00) across all item bank and calibration sample 
sizes. Note that the item bank size lost the influence 
on 9 estimation, which was clearly observed in 
Groups 1 and 2. This was possibly because of the 
higher level of information that item banks had in 
Group 3, and it suggests that whether an examinees 
9 falls into areas where an item bank has sufficient 
information will determine the magnitude of the 
effect of the item bank size on 9 estimation accuracy. 

The results from Group 4 (9 = 0.001 to 0.99) and 
Group 5 (9 = 1.00 to 2.05) confirmed the findings 
for Group 3; the correlations, RMSDs, and ASDs 
were very close to each other across all item banks 
and calibration sample sizes in these groups. It can 
be clearly seen that the information functions of the 
item banks illustrated in Figure 1 had their peaks 
at or around 9 = 1.0. This means that the 9 ranges 
of Groups 4 and 5 covered the area at which all 
item banks had their highest levels of information, 
which resulted in higher 9 estimation accuracy, 
as indicated by the correlations, RMSDs, and 
ASDs. However, as in Group 3, the item bank and 
calibration sample sizes had almost no influence on 
the accuracy of 9 estimates when there was more 
information in a bank that matched the target 
examinees 0 level. 

The findings of the present study indicate that 
calibration sample size had little or no influence on 
9 estimation in CAT applications as long as the bank 
had sufficient levels of information in the regions 
of the banks where individual 9s were located. In 
the present study, accurate 9 parameter estimates 
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were obtained even in research conditions in which 
a sample of 150 examinees was used to calibrate 
the items in the banks. Moreover, a bank of 100 
items could function as successfully as the banks 
with as many as 500 items, especially at certain 9 
levels, depending on the information that pertained 
to those 9 levels in the banks. This suggests that a 
100-item bank could serve for some purposes if it 
was developed using quality items that provided 
information at appropriate 9 levels. However, 
because a banks information level is highly 
dependent on the number of items in it, having 
a bank of 200 items or more would serve better 
for most purposes and would decrease the risk of 
having inaccurate 9 estimates in most situations. 
Moreover, in most situations in the present study, 
the 300-item bank served as well as the 500-item 
bank. This might mean that if a large item bank is 
planned to be developed, a bank of 300 items could 
be considered. 

It should also be kept in mind that these suggestions 
are valid as long as the item banks have sufficient 
information at the 9 levels that match the target 
examinees. From that point of view, the item banks 
that were used in this study were not optimal for 
CAT, which requires a bank with an essentially 
horizontal information function for optimal 
performance. Different results likely would have 
been obtained conditional on 6 for this type of bank. 
However, such banks are difficult to construct, 
and real item banks are likely to be somewhere in 
between these two extremes. 

The findings of the present study were partially in 
parallel with the previous literature, in which there 
was a limited number of similar studies. An item 
bank of 200 items calibrated on 2,000 examinees 
was what Ree (1981) suggested as necessary to 
obtain accurate 9 estimates using CAT. In the 
present study, a bank of 200 items was found to 
be feasible for some purposes as well. However, 
a calibration sample of 150 examinees was also 
found to be feasible for obtaining reasonably 
good 9 estimates in situations in which the item 
bank had sufficient information for the target 9. 
This difference in results is partly attributable to 
the parameter estimation methods that were used 
in 1981. In his study, Ree possibly (not reported 
in the article) used joint maximum likelihood 
estimation (JMLE), in which item and 9 parameters 
are estimated concurrently, given that it was the 
only parameter estimation method available at that 
time. Moreover, it is known that JMLE works best 
when large number of examinees, such as 1,000, 
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and long tests of as many as 60 items are used in 
item parameter estimation (Baker & Kim, 2004). 
Thus, it was typical to have inaccurate estimates 
in small samples. The findings of the present study 
also confirmed the findings of Chuah et al. (2006), 
who found that a pre-calibration sample of 300 
examinees was sufficient for accurate estimation of 
examinee 9 in CAST. However, the present results 
indicated that a pre-calibration sample of 150 could 
also be feasible if sufficient information existed in 
the bank for the target examinees’ 9 levels. 

The purpose of the present study was to investigate 
how 9 estimation in 3PLM CAT was affected by 
calibration sample size conditional on bank size. The 
results indicated that the 9 estimates in CAT were 
robust against the calibration sample size and the 
bank size, especially when the item bank had high 
information that matched the target examinees’ 9 
levels. A sample of 150 examinees might be feasible 
for calibrating items for use in a CAT item bank, 
and an item bank of 200 high-quality items that 
provide high information across the 9 continuum 
could also be useful for many purposes. These 
results contrast with the findings of prior research 
that suggested sample sizes of 1,000 or more for 
accurate item parameter estimates. The difference 
between the present study and prior research is 
that the prior research was mainly concerned with 
the accuracy of item parameter estimation in the 
3PLM, whereas the present research focused on 
the accuracy of the person parameter estimates 
derived through CAT. Apparently, whatever errors 
occur in item parameter estimates as the result of 
small samples and/or small item banks do not have 
a large influence on the person parameter estimates 
that result from CAT administration. 

The findings of the present study have some 
implications for future item bank development 
studies in various disciplines. They can provide 
an empirical basis for CAT researchers, decision 
makers, and educational institutions in countries 
where funding sources are limited and finding 
large numbers of examinees to calibrate items 
while developing an item bank for CAT is difficult. 
The present findings will be especially valuable for 
reducing the cost and time necessary to develop a 
CAT item bank. 

The findings of the present study are limited to the 
item banks that were used in the study and the item 
parameter estimates that were obtained by MMLE 
in the 3PLM. For this reason, they should only be 
extended to CAT applications and item banks under 
similar conditions. Moreover, because the 3PLM 
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was used to calibrate items in the present study, 
the findings are limited to this unidimensional 
dichotomous model of IRT. The findings are also 
specific to the use of Bayesian 6 estimation and 
should be replicated using maximum likelihood 
methods to estimate 6. 

A natural progression of the present study would 
be to design research studies to validate how the 
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